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Abstract 

Weighted likelihood, in which one solves Horvitz-Thompson or inverse probability weighted 
(IPW) versions of the likelihood equations, offers a simple and robust method for fitting models 
to two phase stratified samples. We consider semiparametric models for which solution of 
infinite dimensional estimating equations leads to ^/N consistent and asymptotically Gaussian 
estimators of both Euclidean and nonparametric parameters. If the phase two sample is selected 
via Bernoulli (i.i.d.) sampling with known sampling probabilities, standard estimating equation 
theory shows that the influence function for the weighted likelihood estimator of the Euclidean 
parameter is the IPW version of the ordinary influence function. By proving weak convergence 
of the IPW empirical process, and borrowing results on weighted bootstrap empirical processes, 
we derive a parallel asymptotic expansion for finite population stratified sampling. Whereas 
the asymptotic variance for Bernoulli sampling involves the within strata second moments of 
the influence function, for finite population stratified sampling it involves only the within strata 
variances. The latter asymptotic variance also arises when the observed sampling fractions 
are used as estimates of those known a priori. A general procedure is proposed for fitting 
semiparametric models with estimated weights to two phase data. Several of our key results 
have already been derived for the special case of Cox regression with stratified case-cohort 
studies, other complex survey designs and missing data problems more generally. This paper 
is intended to help place this previous work in appropriate context and to pave the way for 
applications to other models. 
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1 Introduction 



Nevmanl (1938) 



Two phase stratified sampling, also known as double sampling, was introduced by 
to estimate the population mean of a target variable that is costly or difficult to measure. At phase 
one a relatively large random sample is drawn and measurements are made on an auxiliary variable 
that is correlated with the target variable but easier to measure. At phase two measurements on 
the target variable are made for a subsample drawn randomly, without replacement, from within 
strata defined by the auxiliary variable. Neyman showed that the o ptimal, design unbiased linear 
estimator of the population mean is the Horvitz-Thompson 1)19521 ) estimator that weights each 
observation by the inverse of the probability of its selection into the phase two sample. 

Two-phase stratified sampling designs can dramatically reduce the costs of regression modeling 
when the strata depend on (correlates of) both outcome and explanatory variables. A common 
method of estimation is "weighted exogenous sampling maximum likelihood", here simply 



Weighted Likelihood or WL, in which one maximizes the inverse p robability weighted ( 
sum of log-likelihood contribu tions from the phase two observations (jManski and Lerman . 



K albfleisch and Lawless 



(ISkinner et al 



1 98<Ts :t 



PW) 



1977 



1988). Equivalently, one may solve an IPW version of the score equations 



. 4). Al though easy to implement, WL estimators are sometimes seriously 



inefficient (Robins et al 



199 



They may still be of interest, however, because even when the 



model is wrong they consistently estimate the finite population parameters t hat wou l d be o btained 



1989; 



Binder 



1992). Fully 



by fitting the model to complete phase one data (|Xie and Ma nski 
efficient estimators are available for logistic and other parametric regression model s in situations 
where the phase one data consist only of stratum frequencies. See, for example, 
(2003) and the references cited therein. 

The asymptotic properties of WL estimators of E uclidean parameters in parametric models 



follow readily from standard results for M-estimators (|van der Vaarti . I1998I . Chapter 5). WL may 
also be used for estimation of both Euc lidean and infinite dimensional paramet ers in semi parametric 
models, for which the paradigm is Cox (J1972J) proportional hazards regression. iLinl (2000J ) developed 
asymptotic results for both regression coefficients and baseline cumulative ha zard when fitt i ng the 



Cox model to survey data including those obtained using two phase sampling. 



Borganetal 



( 2000J ) 
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obtained the same results for the regression parameters when fitting the Cox model to data from 
exposure stratified case-cohort studies, in which all subjects who have a failure event (the cases) 
are sampled at phase two. One purpose of the present paper is to develop a modern theory of WL 
estimation in semiparametric models that encompasses these previous results, helps to interpret 
them and paves the way for further applications. We also explore the relationship between results 
based on finite population stratified sampling at phase two and those based on i.i.d. variable 
probability sampling with sampling weights estimated using information from phase one. 

2 Notation, Assumptions and Problem Statement 

Suppose Pq„ denotes a probability distribution in a semiparametric model for a random variable 
X £ X, where 9 £ B C M p is the Euclidean parameter and 77, taking values in some arbitrary 
space H, is the nonparametric one. Let Pq = Po nr]n denote the distribution from which X is 



actually sampled. Following closely §25.12 of 



van der Vaartl (|1998h . suppose maximum likelihood 



(ML) estimators (9, 17) are obtained by solving the system 



tfivi(M) = V N l e , n = 

Vm(0,v) = W N B 6tV h - Pe, v Be, v h = V h e H. (1) 



Here ig jTj is the p-dimensional likelihood score for 9, Bq^ is the score operator (|Begun et a l.. 1983) 
working on an infinite dimensional class TL of directions h from which paths of one-dimensional 
submodels for r\ may approach 770, and Pat is empirical measure based on the i.i.d. sequence 
X\, . . . , Xn. Set £q = 4» ,f?o an d Bq = B eo rto . 



Suppose the following assumptions, which slightly strengthen the hypotheses of 



van der Vaart 



(1998, Theorem 25.90), are satisfied so that yN{9 — 6*0,17 — %) is asymptotically Gaussian: 



Al for (9,7]) in a ^-neighborhood of (9q,t]q) the functions ie :V and {Bg jV h, h £ TL} are contained in 
a Po-Donsker class T\ 

A2 Po\\£e, v ~ 4|| 2 and sup heH P Q \B e , n h - B h\ 2 converge to as (9, if) -> (0 O , r?o); 
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A3 the map ^ = : 8 x h R p x £°°(TC) with components 

*i(e,rj) = P ie, v 

V 2 (p,Tj) = P Bo >v h-Pe >ri Be >v h, h£H, (2) 

which is the expectation of the random map = O^JVij ^N2) in ©, has a Frechet derivative 
at (^0)%) that is continuously invertible on its range. 

A4 {6, if) is consistent for (#o,%) an d satisfies ^^(9^t)) = 0. 

Assumption A3 is typically established by showing that the information operator BqBq is 
continuously invertible and thus that rj is estimable at a ^/N rate. This is the most restrictive 
assumption, but one that leads quickly to our main result. 

With two phase sampling, however, X is not observed for all TV subjects. At phase one we 
observe only a coarsening X = X(X) of X plus auxiliary variables U S U that serve to determine 
the sampling strata. X is fully observed for subjects sampled at phase two. Let W = (X, U) G W = 
X x U denote the variables potentially available for everyone, but in fact fully observed only for 
those in the phase two sample, and V = (X, U) £ V = X xU denote the variables actually observed 
for everyone. We write Po for the distribution of W = (X, U) and denote by S^r = • • • , Wjv] 

the sigma field of information, also referred to as the complete data, potentially available for the 
./V subjects. A sequence of binary indicators (£i, . . . , £at) shows which subjects are selected (£j = 1) 
at phase two for observation of X{. We consider t wo pro bability models for the indicators £j. 



In the first, known as Bernoulli or Manski-Lerman ()1977i ) sampling, each phase one subject is 
examined in succession for the value of Vi and the indicator £j is independently generated with 
Pr(£j = l|Wj) = Pr(^j = 1 1 V^) = 7To(Vi) where ttq is a known sampling function. This preserves 
the i.i.d. structure for the observations Vi, Note the crucial missing at random (MAR) 

assumption: ttq depends only on what is observed at phase one. We write Qo for the distribution 
of (Wi, £i). If V is partitioned into J strata V\ U • • • U Vj, stratified Bernoulli sampling corresponds 
to the special case where tto(v) = Pj for v G Vj. We assume that all J strata are sampled with 
positive probability, or more generally that 

< a < tt (v) < 1 for veV. (3) 
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Even though the sa mpling fractions are known, it is advisable to estimate ttq in order to increase 



1994). We consider estimation of ttq using a parametric model 



the efficiency of WL ( R obins et al 
in §6. 

The second sampling model corresponds to Neyman's original design and is usually closer to 
actual practice. Here we observe the entire phase one sample at once and record the stratum 
frequencies Nj = Ylt=x lyj(^i) for j = 1, . . . , J. At phase two samples of size rij < Nj are drawn 
at random, without replacement, from each of the J finite phase one strata. Using now a doubly 
subscripted notation where £j j denotes the indicator variable for i l subject in stratum j, the 
essential features of this design are that, conditionally on S^r: (i) for j = 1,..., J the random 
variables ■ ■ ■ , Cj'iVj) are exchangeable with Pr(^ j j = 1|Sat) = nj/Nj] and (n) the J random 
vectors ■ ■ ■ , £jNj) are independent. Our problem is to estimate (6,r]) using the incomplete 
observations Vi on everyone and the complete observations Xj on subjects sampled at phase two. 



3 Weighted Likelihood Estimator 



i N f- 



i=l 



(4) 



WL estimates are obtained by solving Horvitz-Thompson (IPW) versions of the likelihood 

equations. Define the inverse probability weighted empirical measure by 

N 

N 

where denotes Dirac measure placing unit mass on Xj and 

{ffoiVi) f° r Bernoulli sampling 

n 
jf: if Vi G Vj for finite population stratified sampling. 

Then, instead of (^Q) we solve 

ttfrxCM) = Ve,r, = 

VJf 2 (6,ri) = F%B 0!V h - Pe, n B e , v h = for all h € H. 



(5) 



In view of the MAR assumption, for any integrable function / : X 
Bernoulli or finite population stratified sampling, 



and under either 



E^f(X t ) = E 
vr,- 



E 



7T, 



f(x t 



Ef(Xi), i = l,...,N, 



so that EP^/ = EF N f = P f. Consequently, the random map ^ = O^jvi**^) defined by 
has the same expectation as the random map ^jv m ©, namely \P = (^1,^2) as m ©• 
The implication is that the assumptions A1-A4 made to guarantee the asymptotic normality 
of the ML estimator based on complete phase one data are also the assumptions needed to 
guarant ee the asymptotic normality of the WL estimator based on two phase data. Indeed, van der 
Vaart's (1998) Theorem 25.90, or more precisely his Theorem 19.26 of which it is a restatement, 
applies virtually without change to the Bernoulli sampling setup. The Donsker class J- in Al 
is modified to T = {[£/iro(V)]f(X), f £ J 7 }. Since under the hypothesis (jSJ) it is the product 
of a fixed bounded function with the Donsker class the fact that T is Donsker for the joint 



distribution Qq of (W, £) follows from Ivan der Vaart and Wellnerl ([1996L example 2.10.10). The 
random map ^jv corresponding to the estimating functions (jHJ is ordinary empirical measure Qn 
for {(Wj, i = 1, . . . , N} applied to the unbiased estimating functions (C/^o)^9,n an d (C/^o)Bo >ri h. 
A4 will generally follow from (J3J) and the arguments used to establish consistency for the complete 
data ML estimator, together with (151). A 2 and A3 are unchanged. The more general Theorem 



3.3.1 of 



van der Vaart and Wellnerl (J1996J) is needed, however, to deal with the non i.i.d. data 



induced by finite population stratified sampling. To verify its hypotheses, we first must establish 
weak convergence of the empirical process based on PJ^. 

4 Weak Convergence of the IPW Empirical Process 



Two phase stratified sampling resembles the bootstrap in that it involves random sampling from 
the finite, albeit incompletely obser yed, population \X-\ , . . . , X^}. Here we use results on weighted 



b ootstrap empirical processes from 
in 



Prsestgaard and Wellnerl (|1993l . Theorem 2.2), as incorporated 



van der Vaart and Wellnerl (|1996i . Theorem 3.6.13), to demonstrate weak convergence of the IPW 
empirical process G^r = v^VCPjy — Pq) f° r finite population stratified sampling. First note that, 

t h 

with the subscript j,i denoting the i of Nj observations in stratum j, 



N 



J j 



3=1 



i=l 



3=1 



11; 



(6) 
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where 



P U = ^:E^^ 

3 i=i 

is a finite sampling empirical measure for the j stratum. Similarly one can express the ordinary 
empirical measure as 

1 J 



N 



where 

N 



v w = w.J2 s Xi ^ w = ^ E (8) 

denotes the empirical measure for the j^ 1 stratum. Justification of the second (doubly indexed) 
form is given in Appendix A. 

Combining (jEJ) and (J7J), and letting Gat = y/N(Fjsr — Pq) denote the standard empirical process, 
we have 



7 N = VN(FJj-Po) 

= Vn(f n -p ) + Vn(f* n -f n ) 



where 



U^V^Kn,-^,^) (10) 



3 

is the finite sampling empirical process for stratum j. 

The first term in Q converges to the Po-Brownian bridge process G indexed by the Donsker 
class T mentioned in Al. Let -fo|j(") = E(-|V G Vj) denote Po conditional on membership in 
stratum j, i.e., for measurable Ad X, P y(A) = P [Aly. (V)]/uj with Vj = Poly^V), and let Gj 
denote the P |j~Brownian bridge, also indexed by J-. Our goal is to establish the weak convergence 
of the remaining terms on the RHS of Q. If as N — > oo the sampling fractions converge with 



7 



m/Nj — » pj, the assumption on t he exchangeable "weights" ■ ■ ■ ,S,j,N 3 ) in equation (3.6.8) of 
van der Vaart and Wellnerl (|1996) holds trivially with 



1 Nj 

Jf. E ~ /'A 1 " Pi)- 



i=l 



Furthermore, with ~» denoting weak convergence in i 00 ^), \/Nj (P 



; see Appendix 



B for the proof. Thus their Theorems 3.6.13 and 1.12.4 imply that, for almost every sequence of 
complete data, N . y/pj(l — P~j)Gj. Conditionally on E^r, the processes G^ N . are mutually 
independent because of the independence of the in different strata. Furthermore, by virtue 

of the fact that they al so are (unconditionally) uncorrec ted with Gn = VN(Fn — Pq), which 



follows along the lines of 



van der Vaart and Wellnerl ((1996] . Corollary 2.9.3), or that (conditionally) 



they have the same limiting distributions for almost all sequences of data, the vector of processes 
Gq A^, • • • , Gj Nj ) converges weakly to the vector of independent Brownian bridge processes 
(G, Gi, . . . , Gj). Consequently 



G 



N 



U of 



Borganetal 



(2000). 




(11) 



This result formalizes and extends Proposition 1 of ISelf and Prentice! (|1988T ) and the arguments in 



5 Asymptotic Distributions of the WL estimator 



We apply Theorem 19.26 of 



van der Vaartl ((1998) to conclude that, under Bernoulli sampling, 



Similarly, using Theorem 3.3.1 of 



" "o 
fj-Vo 



7T0 



B n h 



+ Op(l). 



(12) 



van der Vaart and Wellnerl ()1996l ) together with the development 



of the previous section, we conclude that for finite population stratified sampling 

o-e 



+ Op(l). 



(13) 



We have already argued that the hypotheses of the first theorem follow from appropriately modified 
versions of A1-A4. Together with the weak convergence of GJj just established, they also suffice 



van der Vaart and Wellner 



for th e second theorem. In particular, the stochastic condition (3.3.2) of 
( 1996) follows from Al and A2 together with the proof of their Lemma 3.3.5 applied to each of 
&N, ®l,JVi > ■ ■ ■ i ^l,Ni - 

In practice attention is usually focused on inferences for the Euclidean parameter 9. To derive 
a general expression for the asymptotic variance of 9 we further assume 



A5 admits a partition as in equation (25.91) of Ivan der Vaartl (|1998h where the information 
operator BqBq is continuously invertible. 

Following closely the arguments in §25.12 of van der Vaart, we calculate from ()12l) that under 
Bernoulli sampling 



N(9 - ) = Gjv— $o + Op(l) 
whereas from (|13|) under finite population stratified sampling 

y/N0-e o ) = G$ r io + o p (l), 
where in both cases £q denotes the efficient influence function 



and 



IQ = 1 



h = Po 



-1 



I-B (B* B y l B: 



o *o«o 



is the efficient information. Since Pq^o = 0, moreover, both Q14J1 and (|15|) may be expressed 



N(9 - 6 ) 



1 N P 

V -/V • , TTi 
i=l 



(14) 
(15) 

(16) 
(17) 

(18) 



which expansion constitutes the principal result of this paper. 

Under Bernoulli sampling with known ttq the asymptotic variance is therefore 



Yax A VN{9 



Var 



Var E ( —i 

7T0 



X ) + E Var ( — £ 

TTq 



X 



Var(^) + E 
9 



JL_Var(£|X) 

1 - 7T 



7TQ 



-if ) . 



(19) 



In the special case of stratified Bernoulli sampling, with 7Tj = TTo(Vi) = pj for Vi € Vj, this becomes 

v i +E-i i ^^(ir). (20) 

3=1 Pj 

On the other hand, from and (|15|) . the asymptotic variance under finite population stratified 
sampling is 

J , _ 

^ 1 + E^'^ 1Va ^^), (21) 
3=1 Pj 

where Varj(/) = Po\j(f® 2 ) — P®^(f). Comparing the last two expressions shows the substantial 
potential gain from keeping track of the stratum frequencies for the phase one data. 

6 Bernoulli Sampling with Estimated Weights 

Let Vo denote an additional stratum, possibly null, such that £j = 1 for Vi G Vo- Introduction of 
this special stratum with po = 1 does not affect the previous development; in particular, equations 
<HHI)-(ETJ) continue to hold. For Vi £ Vq suppose 

Pr(& = l\Xi, V; a) = Pr(& = l\V f , a) = n a {Vi) < 1 (22) 

where a £ H C 1' is a parameter to be estimated by maximum likelihood from the phase one 
observations {Vj, i = 1, . . . , N} not in Vq. W e assume sufficient re gularity in the model for a, e.g., 



van der Vaartl (|1998T ). so that the ML estimator d is 



to satisfy the hypotheses of Theorem 5.21 of 
consistent and asymptotically normal with influence function 

l~g = l V g fPolvg ( f 2 , ) 'tto fr^ v (23) 
V O 7r (l-7r )y 7r (l-7r ) 

Here for V G Vq, the complement of Vo, ^o(V) = 7r ao (V) is the true sampling function while no(V) 
denotes the q- vector of partial derivatives of TT a (V) with respect to a evaluated at a = ao. If 9(a) 
denotes the WL estimator under two phase Bernoulli sampling with "known" sampling function 
fta(V), then from (|2*3|) and (|T%|) we have 

, §(a ) -9 \ ^( FfriQ \ 
N\ \ 0) ) =Vn[ JV +Op(l)- (24) 



a - a J \ Hn(-o 
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Furthermore, with jfi = 7r(X^; a) for Vi G Vg otherwise 7Tj = 1, we show in Appendix C that under 
some further mild assumptions regarding n a (y) 



N{Fh-V%)£o 



P 1 £o < 



JV(d-ao) + Op(l). 



(25) 



The joint asymptotic normality of (9(ato),a) that follow s from (|24|). together with the Taylor 
expansion l|25j). are precisely the hypotheses used bv lPiercd (1982) to deduce that y/~N[9(a) — 9q] ~^ 
Z where Z £ M. p is mean zero Gaussian with covariance matrix 

52 x -1 



Var A ViV (§(a) - 9 ) = Var ( ^-I ) - fUvff^- ( P)l 



7Tn 



Pnl 



OJ-V, 



*0# 



7T \ °7ro(l-vr )y " v "° vr 
A matrix calculation shows that, when Q26JI is evaluated for stratified Bernoulli sampling 



(26) 



7T Q = 7T a (y) 



1, V € V 

atj, V € Vj, j = 1,... , J, 



the asymptotic variance for the WL estimator 9 with estimated sampling probabilities cL- = rij/Nj 
is identical to the finite population sampling variance (|21|) with pj = aj.o = Vuxirij/Nj. 

Two possibilities present themselves for estimation of the terms in (|26|) . Let 7Tj = ^(V^) for 
Vi £ Vq else 7Tj = 1. Then, using (|T9*|) . we could estimate the first term by 



Var I — £ 



r-l , 1 CiC 1 - n) ~ ®2 



1=1 



the expression in the middle of the second term by 



PqIv, 



7T n 



A/ 



°7T (l-^o) N ~[ 



and similarly for i-b^o^oV^o)- A more empirical approach, however, would be to use the 9 and a 
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influence function contributions themselves to estimate these terms as in 

N 



Yar ( —£ 



^E(|4.,(a-.)) , 



17T, 







A' 



7T 



iV 



N 



i=l \ 



7T; 



7Ti(l - Tti) 



and 



Pnl 



7Tn 



OJ-V, 



7T (1 - 7T ) 



1 

N 



N 



i=l 



7Tj(l - 7Ti) 



■»2 



The resulting asymptotic variance for 9 may be recognized as the comprising the residual sums 
of squares and of cross products from the least squares regressions of each the p components of 
the 6 influence function contributions to which subjects not in the phase two sample 

contribute 0, on the q components of the esti mated a influence function contributions Q23JI . to 



which subjects having Vi £ Vq contribute 0. See 



Henmi and Eguchil (|2004h for a recent discussion 



and interpretation. This suggests the following estimation procedure: 

1. Estimate a from the phase one data and compute the estimated sampling fractions 7Tj. 

2. Estimate 6 and rj from the phase two data by WL, using the inverse 7Tj as known weights. 

3. Regress each component of the influence function contributions for 8 on those for a. 

4. Estimate Var^(#) as the matrix comprising the residual sums of squares and of cross products 
from these regressions. 



Therneau and Grambschl (|2000l . p. 166), who cited earlier work by 



Pugh et al 



( 19921 ) . suggested 



this procedure for the special case of Cox regression, to which we now direct our attention. 



7 Application to the Cox Proportional Hazards Model 



Our development of the Cox model follows closely that of 



van der Vaartl (|199a §25.12) where 



X = (A,T,Z) with T=min(T, C) a censored failure time, A = lpxq the failure indicator and 
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Z £ MP a vector of covariates. The Euclidean parameter is the p- vector of regression coefficients (3 in 
the linear predictor z(3. The nonparametric parameter rj = (A, G, Gz) has three infinite dimensional 
components: A(-) = j Q X(s)ds the baseline cumulative hazard function, assumed differentiable; 
G(t\z) = Pr(C < t\Z = z) the conditional distribution of the censoring time; and Gz, the marginal 
distribution of the covariates. We introduce the usual notation for the "at risk" process Y(t) = 
l[T>t] an d the event counting process N(t) = Alpxt] an d we make the standard assumptions: (i) 
that the true failure time T and C are independ ent given Z\ and (ii) t hat there is a finite maximum 
censoring time r such that Pr[K(r) = 1] > 0. Ivan der Vaar t (1998) makes some further "partly 



unnecessary" assumptions to simplify his development, namely that the covariates Z are bounded, 
that G and Gz have densities as indicated and especially that Pr(C > r) = Pr(C = r) > (see 
discussion in §8). Writing the density for x = (5,t,z), with z a row vector, as 



*"A(t) 



^X(t)(l-G(t-\z))] [g{t\z)\ l - S g z {z\ 



(27) 



and noting that G and Gz factor out of the complete data likelihood. Ivan der Vaarti (|1998l ) considers 
ML estimation for (/3,A) only. With 7i denoting various subsets of the space BV[0,r] of bounded 
functions of bounded variation, he develops the following explicit expressions for the (3 score vector, 
the A score operator that maps functions h £ 7i to functions of the data, its adjoint (but only 
evaluated for the (3 scores) and the information operator that maps 7i onto itself: 



tp A {x) = 5z-ze z ^A(t) 
B[3 )A h(x) 



5h{t) - e zf3 I hdA 
Jo 

P PA Y{t)Ze zp 
/ l (t)P /3)A y(t)e z ' 3 . 



(28) 
(29) 



These are used to calculate the efficient scores 



= S[z-m{t;p)}-e zP [ [z - m(s; p)] dA(s) 
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and efficient information 



h = h - PqBo (-Bo-Bo) 1 -B^o 

= P (e Zf} ° £ [Z - m(t; f3 )f 2 Pr(T > t\Z)dA (t) 

respectively, where I = Pq1 11 and m(t; (3) = (t; (3) / '5 (0) (t; (3) with 

S^(t;/3) = P e z ^Y(t) 
S^(t;f3) = P Ze z ^Y(t). 

To fit the Cox model by WL to two phase stratified samples, first define IPW estimators of 
the two quantities just considered by £(°)(t; (3) = PJje z/3 Y(t) and S^(t;/3) = P n N Ze z/3 Y(t). By 
definition the WL estimators solve 



^i(AA) 
^ N2 (f3,A)h 







w N Bp A h = for all h 6 H, 



(30) 
(31) 



where we have used the fact that Pp^Bp^h = 0. Substituting 

7 / S 1 [s<t] 

for h in 1)3 ljl and solving using (|29j) shows that, for fixed (3, the cumulative hazard function that 
partially maximizes the weighted likelihood and, as is easily checked, satisfies Y^B^^h = for all 
h, is 



N 



mo - - 



6 d^i(s) 



(T; 0) N ti k **S(f > )(s;0 ) 
This may be recognized as an IPW version of the so called 
expression into Q3U|) and evaluating using ((2*5)) yields 



(32) 



1 N f- 

** N1 {p t kp) = F* N A[Z-m(T;(3)} = 



Breslowl Jl974) estimator. Inserting this 

SW(T i; (3 



Z; 



0. 



5(°)(T i; /3)_ 

which is the IPW Cox "partial score" equatio n. Its solution , toget h er with (l""2l . a re the estimators 



proposed for Cox regression by 



Binder! U222), 



Pugh et al 



(1992), 



B organ et al 



(2000, Estimator 
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iD. lLinl (|2000D and others for a variety of complex sampling and missing data problems. Using the 
results of this paper, the large sample properties of (/3, A^) follow from those already developed 
for the ML estimators with complete data, which are given by the same equations with £j = 7Tj = 
l,i = l,...,JV. 



8 Discussion 



The two phase stratified sampling designs considered here are quite flexible in that the phase 



one strata may be formed using all availab 



probabilities. This is in the spirit of 



e infor matio n and sam pled with arbitrary positive 



Binder! dl992h and 



general complex sample survey designs. Others (Bo rgan et al 



Lira (120001) , who considered even more 



2000; 



Kulic h and Lin , 



2004) have 



restricted their attention to covariate stratified versions of the case-cohort design, whereby all 
subjects who fail are sampled at phase two for complete covariate ascertainment. Although this 
may well be an efficient design when the failure rate is low, the assumption that £ = 1 whenever 
A = 1 is often unnecessary and may sometimes be unduly restrictive. Not only does it limit 
application when the phase one population has large numbers of both failures and non-failures, it 
also does so when the sampling has been carried out for one failure type but it is of interest to 
evaluate another. When following patients enrolled in a clinical trial, for example, all deaths may 
be sampled as "cases" but it may later be decided to analyze the data also in terms of "event-free 
survival". In other contexts, biological samples may turn out out to be non-informative so that 
data are still missing for substantial numbers of subjects, including failed cases, who are sampled 
at phase two. Provided one is willing to make the standard MAR assumptions, WL methods 
as described herein may still be used by determining the stratum frequencies for subjects having 
complete data at phase two and using them to estimate the sampling weights. 

The major drawback of WL estimation is its lack of statistical efficiency. Efforts to address thi s 

( 19941 ) . 



deficiency with Cox r e gression have been made by se veral authors including 



Kulich and Linl (jgOOJ) 



Nan et al 



Robins et al 



Scheike and Martinussen 



( 20041 ). Most 



( 20041 ) . iNanl (|2004l ) and 

of these methods are relatively recent and involve sufficiently complex calculations, or sufficiently 
restrictive assumptions, that none have yet seen widespread use. These limitations are certain 
to decline with advances in computing hardware and software, making more efficient estimation 
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methods more widely available. In the meantime, the WL estimation procedure outlined at the end 
of §6 offers a relatively simple and robust alternative. It is likely to remain the method of choice 
for many survey statisticians for the reasons mentioned in the introduction, namely, their interest 
in finite populat i on pa rameters defined as solutions to ML estimating equations. As emphasized 



by 



Robins et al 



( 199J), in view of the interpretation of (|26j) as a residual sum of squares, inclusion 
of additional variables in the model 1)22(1 for it can only enhance the efficiency of 9 estimation. 
When the sampling probabilities vary, as in finite population stratified sampling, inclusion of the 
stratum factors in the model is essential to avoid bias. Finer stratification, or the inclusion of 
auxiliary variables in the model for tt, serves the cause of efficiency. Equation (|21j) suggests that 
such additional variables would be most valuable if they could somehow be chos en to be highly 



corre 



Kulich and Lin 



ated with the efficient scores. The doubly weighted estimator developed by 
(200J) for exposure stratified case-cohort studies is intriguing in that it uses a separate set of (time- 
dependent) weights for each covariate. A preliminary analysis is conducted to estimate quantities 
that resemble within stratum conditional expectations of partial score contributions given the phase 
one data, and these are used to form the weights. An extension of their approach to more general 
two phase stratified sampling designs would be of considerable interest. 

This paper is limited in application to semiparametric models that satisfy the rather stringent 
assumptions A1-A4 of §2. Even in the case of Cox regr ession, these have b een established 



only under the "partly unnecessary" conditions imposed bv Ivan der Vaarti (Jl998|, §25.12.1). His 
assumption that everyone still "on-study" is censored at the common time r would apply to 
situations in which time t referred to calendar time, everyone was entered on study at t = and 
there was a common closing date at t = t. It would not apply, however, if subjects were entered on 
study at various calendar times but withdrawn on a common closing date, and t was taken to be 
"time-on-study" . Nor would it apply if t was "age" and subjects both entered and exited the study 
at various ages. We look forward to further work that relaxes these assumptions, in particular to 
a determination as to whether or not the general approach extends to Cox re gression with time- 
depe ndent covariates and repeated failure events under standard assumptions ( Andersen and Gill 



1982). 



In his Appendix 



Linl (|200Cl ) remarks 
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"To our knowledge, there does not exist a general theory on the conditions required 
for the ti ghtness and weak convergence of H orvitz-Thompson processes. However, the 



results of 



van der Vaart and Wellnerl ([19961 . §§2.9, 3.6, 3.7) can be applied to possibly 



stratified simple random sampling and can potentially be extended to other survey 
designs." 

One purpose of this paper has been to carry out in detail the program mentioned for stratified 
random sampling. We conjecture that our fundamental equation (|18j) applies to Horvitz-Thompson 
estimators for other complex sampling designs, and work is in progress to explore these extensions. 
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9 Appendices 

In Appendices A and B we establish two results slightly more general than needed for the 
development in Section 4. (See the end of Appendix B for the special case required.) The notation 
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in these two appendices should be understood to be independent of the that in the body of the 
paper. 

Appendix A. A Representation of Stratified Sampling. 

Suppose that (ft,A,P) is a probability space and W : (ft, A) — ► (W,B). Write P w for the 
measure induced by W on (W, B); in the notation of section 2, P w = Pq. Suppose that W\, . . . , Wj 
is a (measurable) partition of W: 

(a) W j £B,j = l,...,J; 

(b) Wj n Wj' = for j + f; and 

(c) U/ =1 W,- = W. 

We will assume that P(W £ Wj) = Pj > for j = 1, . . . , J. 
Now consider a new probability space (ft\A\V^) where 



and random variables A = (Ai,...,Aj), W\, ...,W] defined thereon as follows: for = 




...,c4)€fit, 



A(ur ) = A(wJ) ~ Multinomialj(l, (pi, . . . ,pj)) 





(33) 



Now define a random variable W"t : (ft^A^) —> (W, B) by 



W\J) = A^WM) + ■■■ + Aj(J )xl(Jj). 



Note that A, W{, . . . ,W 



are independent by construction. 



Proposition A.l = W on (W,B). That is, P wt = P w as measures on (W,B). 



20 



Proof. First note that 



pt(yrt e yy.) = P t(wt G Wjt A . = !) 



= Pt(w4e W i )P t (A i = l) = l-p i =p j (34) 

using independence of A and Wj , the fact that Wj takes values in VVj with P^-probability 1, and 
P' (Aj = 1) = pj by the definition of P T . 

Now let B £ B. Then since pj > for j = 1, . . . , J, 

j=l jf=l J ' 

= V -p.- by (El 

= ^^ nW fW. B by© 
i=i 
J 

= ^P wr (BnW 3 ) = P H/ (B)=?(W f G5). 

□ 

If W\, . . . , Wjv are i.i.d. P 117 , then we can represent the Wj's in terms of (Aj, W^, . . . , WjJ, 
i = 1, . . . , N, i.i.d. as (A, W\, . . . , Wj) as described in proposition A.l. It follows that 

1 N 

3 i=l 

J AT 

-V, 

W.J2 S ^ (35) 



1 J N 



3 j'=i i=i 



by relabelling the Wj^'s and where TVj = X^i=i Aw on the right side is independent of the Wj^s. 
This yields the promised doubly indexed form of the stratum - specific empirical measure in terms 
of independent Wj/s distributed according to P u where, for B £ B, 

Po{Bl w .) 



Po\ 3 {B) 
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Appendix B. Proof of weak convergence of the stratum-specific empirical process 

Let fj,Nj be as defined in 



i i=l 
where 

iV- 1 ^- = P JV (1 W .) Po(Wj) = Vj > 0. 

Proposition B.l. If T is Pq— Donsker and Vj > 0, then T is Pqu— Donsker on stratum Wj in the 
sense that 

&j,Nj = VNj{V j>Nj - P |i) - % m £°°(.F) (36) 

where Gj, defined by 

%(/) = ^7 1/2 Gp ((/ - PoiiC/))!^), / G ^(^), (37) 

is a P |j"Brownian bridge process. 
Remark 1. Note that 

Var(Gj(f)) = vfPb [(/ - P 0b -(/)) 2 l Wi ] = Var^f) = Var(f(W)\W G W,-). 

Remark 2. The proposition implies that the process y/~Nj(Wj,Nj — Pqu) behaves asymptotically 
the same as that of a sample of fixed size drawn from the conditional distribution Po\j- 

Proof of the proposition. First proof. By the discussion at the beginning of section 2.10.4, 
page 200, van der Vaart and Wellner (1996), Tj = {flw 3 ■ f S J 7 } is Pq— Donsker, and hence the 
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collection Tj = {/lyv- : f € J-U {1}} is also Pq— Donsker. Now we write 



%(¥ hNj f-P QVj f) = ^Nj' ' 



Nj [ GjvC/lwJ G A r(l >Vj )Po(/lw j 

JV 1 JV,-/JV (iV 7 /iV)P (W i ) 

{G^aiw^-Gjvaw^PoijC/)} 



1 



-^=G N ((f - P olj (f))l Wj ) 
' "-o((/- J Pob-(/))lw J )^Gp 0|3 (/), 



and, in fact, 



Gpo((/-P |i(/))lw,): /G^}={Gp 0|j (/): / € ^}. 



Second proof. By the second representation of the stratum-specific empirical measure IP^jv- 
as fj t Nj = N^ 1 Yli^i i where the Wj/s are i.i.d. Po\j, it follows that the empirical process 
Gj t N j = y/Nj(¥j 7 N- — Po\j) is J us t the empirical process of i.i.d. Wj/s, but with a random sample 
size Nj independent of the Wj/s. Since Nj/N — > z/j > 0, it follows from theorem 3.5.1, page 339, 
van der Vaart and Wellner (1996), that G^jv- ~^ Gj in i°°{J-) where Gj is a -fo|j — Brownian bridge 
process as before. □ 

In the application of the results of Appendices A and B in section 4 we take Wi , . . . , Wj to 
be the measurable partition of W induced by the partition Vi, . . . , Vj of V (i.e. Wj = V -1 (Vj) 
for j = 1, . . . , J where V(W) = (X(X), U)). Moreover, the Donsker class T in Proposition B.l is 
taken to be a Donsker class of functions of X only rather than functions of W = (X, U). This is 
exactly what is needed for the development in section 4. 

Appendix C. Proof of equation (|25|) . Besides the consistency and asymptotic linearity 
(J23I) for a assumed in §6, we further assume that < a < ir a (v) as in © and that 
1 1 -ttM 



-(a-ao) < V'WI" - a | 1+C (38) 



7Ta(«) 7Tq:o( U ) ^ofa) 

for a in a neighborhood of cto where £ > and ^ satisfies Eip 2 (V) < oo. The second assumption 
will typically follow from the first provided that Tr a has a continuous second derivative. For 
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example, suppose that 7r a is given by a logistic regression model with linear predictor v T a where 
v = v(v) £ M g . Then Taylor's formula with remainder shows that the LHS of (|38j) equals 



(a — «o) vv (a — ao) 



holds with C = 1 provided e v 



with a* on the line segment between a and ao- Thus the condition 
= K a (v) /[I — ir a (v)} is bounded away from and V has finite fourth 



6 6 



moment. It follows that 

i ^ 

-^lvg(V-)^o(^) 



V 



1=1 



_J i I jgW 



(d - ao) 



1 * 



i=l 
V 



MVi) 



MV) 



d — ao) 
(a - a ) 



where by ©, the similar assumption for 7TQ, and (|38j) . 



A' 



i=l 



1 



1 



< 



1 1 



N 



^Ew)to 





MVi) Tao(^i) ^(1^) 

Id — ao| 1+ ^ 



(d - a ) 



i=l 



(39) 



= O p (l)|d — ao||d — ao|^ 
= O p {l)O p {N- l ' 2 )o p {l). 

Multiplying through (|3*9*|) by \/lSl, we conclude that l|2*5|) holds by virtue of \J~NRn = o P (±) and the 
strong law of large numbers. 
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